Search CORE

1,813 research outputs found

Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis

Author: Gu Hung-Yan
Wang Hsin-Min
Wang Ju-Chiang
Publication venue
Publication date: 15/02/2015
Field of study

The purpose of this study is to investigate how humans interpret musical scores expressively, and then design machines that sing like humans. We consider six factors that have a strong influence on the expression of human singing. The factors are related to the acoustic, phonetic, and musical features of a real singing signal. Given real singing voices recorded following the MIDI scores and lyrics, our analysis module can extract the expression parameters from the real singing signals semi-automatically. The expression parameters are used to control the singing voice synthesis (SVS) system for Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The results of perceptual experiments show that integrating the expression factors into the SVS system yields a notable improvement in perceptual naturalness, clearness, and expressiveness. By one-to-one mapping of the real singing signal and expression controls to the synthesizer, our SVS system can simulate the interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor

arXiv.org e-Print Archive

CiteSeerX

Affective Music Information Retrieval

Author: Wang Hsin-Min
Wang Ju-Chiang
Yang Yi-Hsuan
Publication venue
Publication date: 18/02/2015
Field of study

Much of the appeal of music lies in its power to convey emotions/moods and to evoke them in listeners. In consequence, the past decade witnessed a growing interest in modeling emotions from musical signals in the music information retrieval (MIR) community. In this article, we present a novel generative approach to music emotion modeling, with a specific focus on the valence-arousal (VA) dimension model of emotion. The presented generative model, called \emph{acoustic emotion Gaussians} (AEG), better accounts for the subjectivity of emotion perception by the use of probability distributions. Specifically, it learns from the emotion annotations of multiple subjects a Gaussian mixture model in the VA space with prior constraints on the corresponding acoustic features of the training music pieces. Such a computational framework is technically sound, capable of learning in an online fashion, and thus applicable to a variety of applications, including user-independent (general) and user-dependent (personalized) emotion recognition and emotion-based music retrieval. We report evaluations of the aforementioned applications of AEG on a larger-scale emotion-annotated corpora, AMG1608, to demonstrate the effectiveness of AEG and to showcase how evaluations are conducted for research on emotion-based MIR. Directions of future work are also discussed.Comment: 40 pages, 18 figures, 5 tables, author versio

arXiv.org e-Print Archive

CiteSeerX

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

Author: Huang Wen-Chin
Hwang Hsin-Te
Peng Yu-Huai
Tsao Yu
Wang Hsin-Min
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/08/2018
Field of study

An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner. A previous study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for VC. How- ever, VAE using other types of spectral features such as mel- cepstral coefficients (MCCs), which are related to human per- ception and have been widely used in VC, have not been prop- erly investigated. Instead of using one specific type of spectral feature, it is expected that VAE may benefit from using multi- ple types of spectral features simultaneously, thereby improving the capability of VAE for VC. To this end, we propose a novel VAE framework (called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple objectives in order to constrain the behavior of the learned encoder and de- coder. Experimental results demonstrate that the proposed CD- VAE framework outperforms the conventional VAE framework in terms of subjective tests.Comment: Accepted to ISCSLP 201

arXiv.org e-Print Archive

Crossref

A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents

Author: Berlin Chen
Hsin-min Wang
LIN-SHAN LEE
Publication venue
Publication date: 01/01/2004
Field of study

In recent years, statistical modeling approaches have steadily gained in popularity in the field of information retrieval. This article presents an HMM/N-gram-based retrieval approach for Mandarin spoken documents. The underlying characteristics and the various structures of this approach were extensively investigated and analyzed. The retrieval capabilities were verified by tests with word- and syllable-level indexing features and comparisons to the conventional vector-space model approach. To further improve the discrimination capabilities of the HMMs, both the expectation-maximization (EM) and minimum classification error (MCE) training algorithms were introduced in training. Fusion of information via indexing word- and syllable-level features was also investigated. The spoken document retrieval experiments were performed on the Topic Detection and Tracking Corpora (TDT-2 and TDT-3). Very encouraging retrieval performance was obtained

CiteSeerX

A Study of Using Cepstrogram for Countermeasure Against Replay Attacks

Author: Lee Shih-Kuang
Tsao Yu
Wang Hsin-Min
Publication venue
Publication date: 08/04/2022
Field of study

In this paper, we investigate the properties of the cepstrogram and demonstrate its effectiveness as a powerful feature for countermeasure against replay attacks. Cepstrum analysis of replay attacks suggests that crucial information for anti-spoofing against replay attacks may retain in the cepstrogram. Experimental results on the ASVspoof 2019 physical access (PA) database demonstrate that, compared with other features, the cepstrogram dominates in both single and fusion systems when building countermeasures against replay attacks. Our LCNN-based single and fusion systems with the cepstrogram feature outperform the corresponding LCNN-based systems without using the cepstrogram feature and several state-of-the-art (SOTA) single and fusion systems in the literature.Comment: Submitted to INTERSPEECH 202

arXiv.org e-Print Archive